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Abstract 

Motivation: 

The flexibility in gap cost enjoyed by Hidden Markov Models (HMMs) is expected to afford them better 
retrieval accuracy than position-specific scoring matrices (PSSMs). We attempt to quantify the effect 
of more general gap parameters by separately examining the influence of position- and composition- 
specific gap scores, as well as by comparing the retrieval accuracy of the PSSMs constructed using an 
iterative procedure to that of the HMMs provided by Pfam and SUPERFAMILY, curated ensembles of 
multiple alignments. 

Results: 

We found that position-specific gap penalties have an advantage over uniform gap costs. We did not 
explore optimizing distinct uniform gap costs for each query. For Pfam, PSSMs iteratively constructed 
from seeds based on HMM consensus sequences perform equivalently to HMMs that were adjusted to 
have constant gap transition probabilities, albeit with much greater variance. We observed no effect of 
composition-specific gap costs on retrieval performance. 

Availability: 

The scripts for performing evaluations are available upon request from the authors. 
Contact: 



yyu@ncbi.nlm.nih.gov 



*to whom correspondence should be addressed 



1 Introduction 



Information retrieval from molecular databases by sequence alignment is an essential component of modern 
biology. The effectiveness of retrieval strategies depends crucially on how alignments are scored. A pairwise 
alignment score typically combines scores for the substitutions, insertions, and deletions that transform one 
sequence into another. Scores for substitutions are derived from a substitution matrix, while scores for 
insertions and deletions are known as gap costs. The importance of gap costs has prompted numerous 
studies proposing various reasonable gap penalty schemes iQi, 28. 6l 351 271. 

Search accuracy may be improved substantially by using position-specific scoring matrices (PSSM) lllSn . 
In addition, it is possible to introduce position- and composition-specific gap costs, which so far have been 
implemented primarily by Hidden Markov Models (HMMs) [22^ JJ. In thispaper we attempt to quantify the 
effect of different gap scores on retrieval performance using PSI-BLAST yj] and HMMER , canonical 
examples of software tools employing PSSMs and HMMs, respectively. 

As its name suggests, a PSSM assigns scores to amino acids in a database sequence based on the posi- 
tion in which they occur in the alignment. PSI-BLAST computes and scores alignments using a heuristic 
approximation to the Smith- Waterman algorithm jjoll with affine gap costs iTll providing uniform penal- 
ties for opening and extending a gap. PSSMs used by PSI-BLAST may be generated through an iterative 
search procedure, or obtained from other sources, such as databases of curated multiple sequence alignments 
(MSAs). 

Two publicly available sources of curated alignments are the Pfam ifioll and SUPERFAMILY 112, 33] 
databases. In both, each MSA is represented by an HMM, which may be used for similarity searches. 
An HMM is a finite-state automaton, characterized by state-to-state transition probabilities and emission 
probabilities that generate hypothetical protein sequences. See Fig. [T]for an example and the Appendix for 
more details. 




Figure 1: An example of a protein profile HMM architecture used by HMMER. The model contains n 
positions plus a begin state (B) and end state (E). Each position contains a substitution (S) and a deletion 
state (D), with a possible insertion state (!) between two 5'-nodes. Allowed transitions are shown by arrows. 
To simulate local alignments, transitions B^Si and Si E, for any Si, are permitted. 



The HMMER package uses the Viterbi algorithm f/^, which finds the highest-scoring sequence of 
states in the HMM that produces the database sequence. The probability that a particular amino acid is emit- 
ted in a HMMER substitution state may be identified with the probability that it occurs in a corresponding 
position in a PSI-BLAST PSSM. On the other hand, HMMER allows position- and composition-specific 
gap scores, which model the probability that an insertion or deletion occurs at a particular position in an 
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alignment. 

With their greater gap cost flexibility, HMMs may be expected to have better retrieval accuracy than 
PSSMs. We attempt to quantify the effect of HMMER's use of more general gap parameters by separately 
examining the influence of position- and composition-specific gap scores. We also compare the retrieval 
accuracy of the PSSMs constructed using PSI-BLAST's iterative procedure to that of the HMMs provided 
by the Pfam and SUPERFAMILY collections. Our results may suggest some directions for improvements 
to PSI-BLAST, and the magnitude of the improvements one might expect. 

We collected protein profile HMMs from Pfam and SUPERFAMILY. We then modified the profiles 
from each source to simulate different retrieval strategies, and used them as queries for HMMER and PSI- 
BLAST to search a set of sequences from the SCOP (structural classification of proteins) database liil lill. 
which forms our 'gold standard'. We use the results of the searches to evaluate and compare the retrieval 
performance of the search methods considered. 

SCOP is a database of protein domains, classified by structure, function and sequence. Protein domains 
are classified into a hierarchy of class, fold, superfamily and family. Domains sharing the same superfamily 
are assumed to be homologous. For our testing purposes, we use the ASTRAL 40 [5] subset of SCOP 
(release 1.71), consisting of domain sequences that were filtered so that no two sequences share more than 
40% pairwise identity. ASTRAL has been used as the testing set in a number of performance evaluations of 
protein sequence comparison algorithms 113, 31, 26, 38']. 

It is generally useful to evalute not only the difference in performance of two search methods, but also 
whether such a difference is statistically significant. A number of procedures have been proposed, mostly 
based on bootstrap resampling with replacement il3 Ll26ll . In this context. Green and Brenner 11311 observed 
that large superfamilies have an undue influence on the results, as the number of possible relationships 
grows quadratically with the number of members in a superfamily. They therefore proposed two weighting 
schemes that reduce the influence of large superfamilies. Price et al. [26] noted technical challenges in 
obtaining accurate variances for the weighted statistics and proposed an improved bootstrap. 

Our query sets, based on Pfam and SUPERFAMILY, contain several models for each SCOP-classified 
superfamily. Some superfamilies are overrepresented both in the query sets and in the ASTRAL database. 
We propose a different method than Price et al. [26] to address the difficulties associated with having super- 
families of different sizes. Our strategy is to sample without replacement 3/4 of the superfamilies and then 
select a single model for each superfamily in any given query set. Hence, each sample contains no more than 
a single profile from each superfamily and therefore captures the most distant relationships among queries. 



2 Materials and Methods 
Software tools 

For HMM -based queries, we used the HMMER package (version 2.3.2) flQ], which is also used internally 
by Pfam. Local alignment between a sequence and an HMM is allowed by the nonzero probabilities of enter- 
ing match nodes directly from the begin state, as well as moving directly to the end state from them (Fig. [Hi. 
The statistical significance of each alignment score is estimated using an assumed extreme value distribution, 
with model-specific parameters. The final E- value, adjusted for model and sequence composition, is used 
to rank the hits. Another popular HMM platform is SAM H i, llll], which is used by SUPERFAMILY. 
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We used HMMER rather than SAM for all our HMM -based queries because the programs' retrieval perfor- 



mances were shown to be comparable l23l 13411 and because the SUPERFAMILY models were available in 
HMMER format. 

For PSSM-based queries, we used PSI-BLAST (version 2.2.17) The statistics of PSI-BLAST scores 



are based on the extreme value distribution ual with a correction for finite sequence length. The statistical 



significance of each database hit is refined by taking into account its composition as well as that of the PSSM 



m. 

PSI-BLAST allows one to start a search from a 'checkpoint' file containing a PSSM saved from an 
earlier PSI-BLAST run, or built by other means. In addition to a PSSM, PSI-BLAST requires gap penalties 
as input: a gap opening cost and a gap extension cost. The choice of gap penalties is restricted to a few 
values because the parameters required to produce accurate statistics are precomputed using large-scale 
simulations. For both HMMER and PSI-BLAST runs, we used the standard search exectutables with their 
default settings. 

Query sets 



Following Wistrand and Sonnhammer 113411 . we constructed a query set of Pfam (release 22.0) models by 
identifying all Pfam-A models that were cross referenced by Pfam with an identifier in SCOP 1.71, and 
mapping the cross-referenced SCOP identifier to a SCOP superfamily. We did not consider models that 
have multiple domains mapping to different superfamilies. 

We filtered the resulting set of Pfam models using two additional rules. First, any model mapping to a 
SCOP superfamily that had fewer than four members in ASTRAL 40 was removed from further considera- 
tion, to avoid superfamilies with a small number of members from disproportionally influencing the results. 
Next, we examined the MSA used to generate the Pfam profile and kept only those families whose MSA 
contained at least 10 sequences and had an average sequence length of at least 30 amino acids. Our final 
Pfam query set contained 703 Pfam models representing 299 superfamilies. We used the profiles from the 
P f am_f s set, built for local/local alignment. 

Our second query set consisted of all 6729 models from the SUPERFAMILY database (release 1.69) 
that belonged to the 299 superfamilies in the Pfam query set. These models were also built for local/local 
alignment. The above query sets, paired with HMMER, formed our first two search methods, which we 
named HOF (HMM, 'original', Pfam) and HOD (HMM, 'original', SUPERFAMILY). 

The second pair of search methods, called HBF and HBU (see Table[T]for an outline of all search meth- 
ods), was constructed by taking the HMMs from HOF and HOU, respectively, and replacing all emission 
scores for each insert state with 0. This is equivalent to setting all insertion emission probabilities to the 
background probabilities. 

We constructed the third pair of search methods, called HGF and HGU by taking the HMMs from HBF 
and HBU, respectively, and adjusting the state transition probabilities to correspond to those implied by the 
affine gap penalties used by PSI-BLAST (see Appendix for a detailed explanation). 

Let a denote the gap opening cost and /3 the gap extension cost, in bits. We used the default penalty of 
PSI-BLAST, which is 11 (a = 5.040 bits) for gap opening and 1 for gap extension {(3 = 0.458 bits). This 



scale was chosen to match the scale of BLOSUM62 11911 . the default scoring matrix of BLAST. 



For each position m of an HMM, we left the probabilities P{B — > Sm) and = P{Sm — > E) un- 
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Name 


Description 




HO 


Original HMM dataset. 




HB 


HMMs, background insertion emission probabilities 




HG 


HMMs, constant state transitions and background insertion 




emissions 




PO 


PSSMs, converted from original HMMs. 




PC 


PSSMs, from 5 PSI-BLAST iterations over nr using 
consensus seeds 


profile 


PS 


PSSMs, from 5 PSI-BLAST iterations over nr using 
domain sequence seeds 


SCOP 



Table 1 : Nomenclature of query sets. As shown in this table, the first two letters of the abbreviations of 
various search strategies denote the type of profile (HMM or PSSM), and the method of construction. The 
third letter is optionally appended to show the database of origin (F for Pfam, U for SUPERFAMILY). 



changed and set the remaining transition probabilities as follows: 

P{D^^Dra+l) = P{Im^Im) = V, (1) 
P{Dm^ Sm+l) = P{Im^ Sm+l) = — V, (2) 

P{Sm^ Dm+l) = P{Sm—^Im) = T~TT, 0) 

I + 2fj, — 1/ 

P{S^^S^,,)= ^'-lf'-'\ (4) 
1 + 2^ — u 

where ji = 2^^^ and v = 2^. The probabilities were read from HMMER files, converted from scores, 
modified and written back as scores, as per HMMER convention |9]. After modification, the HMMER 
statistical parameters of each HMM of HBF, HBU, HGF and HGU were recalibrated. 

The remaining search methods used PSI-BLAST with default gap penalties. POF and POD used PSSMs 
derived from HOF and HOU, respectively, by taking the match state emission probabilities and writing them 
in PSI-BLAST checkpoint format. PCF and PCU used PSSMs obtained using the standard PSI-BLAST 
iterative procedure. We obtained the consensus (most likely) sequences of POF and POD profiles and used 
them as seeds for the initial searches, running 5 iterations in total against nr, the database of non-redundant 



protein sequences maintained by NCBI (frozen on Apr 11, 2007) 113211 . 

The final search method, named PSU used the same construction procedure as POD except that the 
SCOP sequences associated with SUPERFAMILY models were used as PSI-BLAST seeds instead of profile 
consensus sequences. 



Performance evaluation 

As described above, our query sets contained no profiles assigned to more than one SCOP superfamily. Each 
pair p, s, where p is a query profile and s is an ASTRAL sequence, was classified as similar ('positive') 
if s belongs to the superfamily associated with p, and not similar ('negative') otherwise. For every query 
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(a) (c) 
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(b) (d) 




0.3 0.4 0.5 0.6 -0.3 -0.2 -0.1 0.0 0.1 

ROC SCORE ROC SCORE RELATIVE DIFFERENCE 



Figure 2: ROC score statistics of one million samples. In each sample, 224 superfamilies are first randomly 
chosen from 299 superfamilies. A representative query profile is then randomly selected from each chosen 
superfamily. ROC score histograms from using Pfam HMMs (a) and SUPERFAMILY HMMs (b) show ap- 
preciable difference in average ROC scores for each search method tested: SUPERFAMILY HMMs always 
perform better. Note that in panels (a) and (b), the curve for HO is completely covered by that for HB. Using 
HOF and HOU as baselines, the values of RRSD224 (measurement at 1 EPQ) between various methods and 
the baselines are computed for each sample. The resulting histograms are shown in panels (c) and (d). 



Pfc from a set of queries, denote by Np{pk) the number of ASTRAL 40 sequences belonging to the same 
superfamily as (i.e. the total number of positives for p^) and let Np = J2k ^p(Pfe)- 

Comparing each query profile to the ASTRAL 40 database, we retrieved a number of sequences ranked 
according to their E-values. These sequences were classified as true or false positives. For a given search 
strategy, after merging the results for the whole set of queries, we obtain the (step) functions p{E) and 
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f{E) giving respectively the cumulative numbers of true and false positives with E- value E or smaller. The 
function p can also be expressed as a function of /, the number of false positives and the graph of p{f) 
versus / is called the ROC (receiver operating chai^acteristic) curve HHH. The same curve can be 
displayed as a coverage vs. error-per-query (EPQ) or CVE plot. 

Our main performance statistic is the (truncated) ROC score. Given a number of false positives F, the 
ROC F score is defined by 
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F 



It represents the accuracy of the search method (given a set of queries) for a given number of false positives. 
To compare two search methods Mi and M2 we compute their relative ROCi? score difference, denoted 
RRSDiT, defined by 

RRSD (MM)- RQCf(Mi)-R0C^(M2) 

To overcome the aforementioned problems associated with overrepresentation of large superfamilies, 
we sampled according to the superfamily classification. For each sample we randomly picked 224 out of 
299 superfamilies (leaving 1/4 out) without replacement. Then, we selected one representative profile for 
each superfamily to form a sample query set. Search methods using the profiles originating from the same 
source (Pfam or SUPERFAMILY) used the same samples so that their performances could be compared for 
each sample. Our main statistic is the RRSD224 per sample, which measures performance at 1 EPQ or less. 
It allows a fair comparison of search methods. 



3 Results 

Fig. |2] shows the distributions of ROC224 scores and their relative differences (RRSD224) per sample with 
respect to HO for all query sets. Comparison of Fig. |2](a) and (b) shows that, in general, the strategies 
using profiles from SUPERFAMILY perform better than those using Pfam profiles. In terms of relative 
difference (Fig. [2] (c,d). Table ^, using both Pfam and SUPERFAMILY profiles, original HMMs (HO) 
perform significantly better than all other query sets except HB. There is no perceivable difference between 
HB and HO. There is also no significant difference between HG and PO. 

In the case of PSSMs, POD gives better performance than PCU and PSU, but there is no significant 
difference between POP and PCF, although PCF shows a large variance in performance. In a number of 
cases, a PCF sample even outperforms the corresponding HOP sample. The relative ROC score difference 
between PCU and PSU is slightly positive, but not significantly so. 

Using profiles from Pfam (SUPERFAMILY), we observed two (three) clusters of search strategies that 
performed equivalently based on RRSD224 (Fig.|2](c,d)). This trend in performance is supported by Fig.|3l 
which displays examples of CVE curves for all alignment methods tested. The samples associated with 
these CVE curves have the median ROC224 score. 
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(a) 


HOF 


HBF 


HGF 


POF 


PCF 


HOF 


0.0 


0.0 


0.0 


-0.1 


0.0 


0.1 


2.9 


4.5 


6.3 


2.4 


5.1 


8.1 


-0.6 


8.5 


19.5 


HBF 


-0.1 


-0.0 


0.1 


0.0 


0.0 


0.0 


2.9 


4.5 


6.3 


2.4 


5.1 


8.1 


-0.6 


8.5 


19.4 


HGF 


-5.9 


-4.3 


-2.8 


-5.9 


-4.3 


-2.8 


0.0 


0.0 


0.0 


-1.7 


0.5 


3.0 


-4.9 


3.8 


14.1 


POF 


-7.5 


-4.8 


-2.3 


-7.5 


-4.8 


-2.4 


-3.0 


-0.5 


1.8 


0.0 


0.0 


0.0 


-5.1 


3.2 


13.1 


PCF 


-16.3 


-7.8 


0.6 


-16.3 


-7.8 


0.6 


-12.3 


-3.7 


5.1 


-11.6 


-3.1 


5.3 


0.0 


0.0 


0.0 



(b) 


HOU 


HBU 


HGU 


POU 


PCU 


PSU 


HOU 


0.0 


0.0 


0.0 


-0.4 


-0.0 


0.4 


2.9 


6.9 


10.2 


2.0 


5.3 


8.2 


12.4 


21.4 


30.3 


15.0 


24.5 


34.2 


HBU 


-0.4 


0.0 


0.4 


0.0 


0.0 


0.0 


2.9 


6.9 


10.2 


2.1 


5.3 


8.2 


12.4 


21.4 


30.3 


15.0 


24.5 


34.2 


HGU 


-9.3 


-6.4 


-2.8 


-9.3 


-6.4 


-2.8 


0.0 


0.0 


0.0 


-4.1 


-1.4 


1.4 


7.0 


13.6 


20.8 


9.6 


164 


24.3 


POU 


-7.6 


-5.0 


-2.0 


-7.6 


-5.1 


-2.0 


-1.4 


1.5 


4.3 


0.0 


0.0 


0.0 


8.1 


15.3 


22.8 


10.7 


18.2 


26.3 


PCU 


-23.3 


-17.6 


-11.0 


-23.3 


-17.7 


-11.0 


-17.2 


-12.0 


-6.6 


-18.6 


-13.3 


-7.5 


0.0 


0.0 


0.0 


-1.5 


2.5 


7.2 


PSU 


-25.5 


-19.6 


-13.0 


-25.5 


-19.7 


-13.1 


-19.5 


-14.1 


-8.8 


-20.8 


-15.4 


-9.7 


-6.7 


-2.4 


1.5 


0.0 


0.0 


0.0 



Table 2: Summary of statistics of RRSD224 between every pair search strategies using the same source. 
In Fig. [2](c) and (d), HOF and HOU were used as the baselines for Pfam and SUPERFAMILY search 
strategies, respectively, and the histograms of RRSD224 relative to the baselines are shown. It is impractical 
to show such histograms for all possible baselines. However, for each pair of search strategies, we may sort 
(in ascending order) their one million values of RRSD224 and record the corresponding RRSD224 value at 
various designated percentiles. In the table, there are three numbers in a row for any given pair of search 
strategies. As an example, the numbers 2.9, 4.5 and 6.3, associated with Mi = HBF and M2 = HGF, 
are located in the row labelled by HBF and within the column headed by HGF. Those numbers, when 
divided by 100, have the following interpretation: the leftmost corresponds to the RRSD224 value at the 
2.5-th percentile, the middle to the median and the rightmost to the 97.5-th percentile. Subtable (a) records 
the numbers associated with Pfam search methods, while subtable (b) documents those associated with the 
SUPERFAMILY strategies tested. 

4 Discussion and Conclusion 

The clear separation in retrieval performance between the SUPERFAMILY and Pfam profiles could be 
explained by the fact that the former are based on ASTRAL sequences, which form our testing set as well. In 
contrast, Pfam models are based on a variety of sequence sources and were not trained on ASTRAL. Hence, 
a degree of overfitting the SUPERFAMILY models to the testing set, as well as the fact that ASTRAL is 
structure based, may explain the overall differences in performance. 

Another interesting observation is that CVE curves (Fig. O cross at low EPQ and form distinct clusters 
above 0.5 EPQ. Due to small sample size, the coverage at low EPQ is expected to have a larger uncertainty, 
thus the crossing of CVE curves there is anticipated. At moderate EPQ, the distinct clusters indicate that the 
relative retrieval efficiency is not influenced by the choice of EPQ. 

On both testing collections, we have observed almost no difference in performance between the original 
HMMs (HO) and the models derived from them having insertion emission probabilities reset to the back- 
ground (HB). Examining the models in HMMER format, we found that the insertion emission distributions 
were almost constant over all the positions, with the common distribution being slightly biased in favor of 
hydrophilic amino acids. The average relative entropy between this distribution and the background distri- 
bution is very small (0.037 bits for Pfam, 0.005 bits for SUPERFAMILY), explaining the very small effect of 
the insertion emissions on the retrieval performance. Note that SUPERFAMILY models had higher overall 
probabilities of entering a gap state and hence showed a larger influence of insertion emissions than Pfam 
models (Fig.|2](c,d)). 
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Figure 3: Example CVE curves for various search strategies based on Pfam (top) and SUPERFAMILY 
(bottom) profiles. Each curve shown is a representative that corresponds to a sample with ROC224 score 
equal to the median of 1,000,000 samples. 



In addition, an insertion emission distribution biased in favor of hydrophilic amino acids may not be 
appropriate for all positions within proteins: it implicitly assumes the globular protein structure, with hy- 
drophobic core and hydrophilic surface. Finally, from an information theoretic point of view, it is very diffi- 
cult to reliably estimate insertion emission probabilities. In particular, if one wishes to establish an emission 
model whose emission probabilities are similar to those of the background and wants to confidently distin- 
guish those two sets of probabilities, it is necessary to have a large amount of data. The following example 
illustrates this point. 

In the Pfam insertion emission model. Leucine's emission probability, 0.0676, has the largest deviation 
compared to the background 0.0934. Consider a simple coin tossing experiment where the probability of 
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seeing a leucine (head) isp = 0.0676 and the probability of seeing any other amino acid (tail) is 1 — 0.0676. 
One may ask how many tosses (number of amino acids present in a gap column of an MSA) are needed in 
order to confidently rule out the possibility that the probability is 0.0934. It is well known that a binomial 
distribution in the large number limit becomes a Gaussian. In our example, the probability of observing k 
heads out of n tosses becomes 



V 2-7r n 

To reject with 85% confidence the value of 0.0934 as the probability of seeing a head, the absolute difference 
between the two probabilities, 0.0934 and 0.0676, must be greater than or equal to 1.02 times the standard 
deviation, \/p{l — p)/n. This leads to 



/ 0.0847 

0.0258 > 1.02a/ ^ n > 132 . 

V n 

When applied to estimating insertion emission probabilities, this example implies one needs to have about 
132 amino acids in a gap column of a multiple alignment. This number seems large for columns associated 
with an insert state, as these columns normally have more gaps than amino acids. On the other hand, we can 
confidently determine emission probabilities for columns that contain mostly amino acids and are therefore 
usually assigned to substitution states. Furthermore, the dominant amino acid in a match column often 
has very different observed and background frequencies. For example, consider a match column with 20% 
leucine. The same calculation as above tells us that we need only eight or more amino acids in the match 
column to indicate a preferrence for leucine. Of course, considering the sub-dominant amino acids requires 
more entries in the match column. 

Comparing HO to HG and PO, we see that profiles with position-dependent gap parameters have 
5% better retrieval performance (as measured by the median RRSD224 value) than those with position- 
independent ones. This is an area where HMMs are clearly superior to the PSSMs with constant gap penal- 
ties, as used by PSI-BLAST. Hence, a possible direction for improvement of PSI-BLAST is to introduce 
position-dependent gap parameters. When interpreting this difference, one should note that we did not opti- 
mize the PSI-BLAST gap penalties, but use only the defaults. It is therefore possible that the performance of 
PSI-BLAST with a better set of gap opening and extension penalties would more closely match the perfor- 
mance of HMMs. Another possibility is to estimate and optimize gap parameters for each PSSM separately, 
at the time of its creation (that is, each PSSM would still carry a single, position independent, gap opening 
and gap extension penalty, but they would not be input beforehand but estimated from the data). The prac- 
tical problem with these suggested improvements is that the statistical parameters for position-specific gap 
penalties cannot be quickly computed as yet, and one is therefore restricted to the costs for which the pa- 
rameters have been precomputed. Another possibility is to modify PSI-BLAST to use the hybrid alignment 
algorithm fi^, '3?], which is probabilistic, naturally accepts PSSMs with position-specific gap costs, and has 
well-chai-acterized, universal statistics. 

It is not surprising that the performances of HG and PO show no significant difference because HG 
was designed to simulate the PSI-BLAST gap parameters in the HMM framework. Some differences still 
exist due to a fundamental difference between the underlying algorithms. First, although the score statistics 
for HMMER and PSI-BLAST are both based on the extreme value distribution, there are still differences 
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in details. Second, PSl-BLAST alignments may have longer segments of ungapped alignment because 
the score associated with ungapped alignment is not reduced by the probability of entering another node. 
Some difference can also be explained by slightly different background probabilities in each case. Finally, 
local alignment is achieved through different mechanisms: PSI-BLAST alignments terminate when their 
accumulated score is maximal, while HMMER alignments terminate only when they hit the end state. Thus, 
HMMER alignments may tend to be more global with respect to the profile. 

The difference in performance of PSI-BLAST using PSSMs constructed in different ways shows that 
focusing on profile construction as well as on position-specific gaps may yield significant improvement. In 
particular, the performance of PSSMs converted from HMMs (PO) versus those iteratively constructed (PC 
and PS) shows that a more carefully constructed profile may yield better performance, with the difference 
being more pronounced in SUPERFAMILY than in Pfam. The fact that the PSSMs obtained iteratively from 
nr based on SUPERFAMILY consensus seeds generally perform better than those originating from Pfam 
consensus seeds shows the importance of the choice of the initial seed sequence. This is further emphasised 
by the slightly better performance of the PSSMs based on the consensus sequence as seed (PCU) than 
the performance of those based on the seeds taken from ASTRAL (PSU). Hence, another possible way of 
improving PSI-BLAST would be to run one iteration using the normal scoring matrix and construct a profile 
as before, but then to rerun the search using the consensus sequence as the seed instead of proceeding into 
the iterative stage with the profile. In that way, a more 'central' seed can be obtained, which, while not 
corresponding exactly to any sequence present in the dataset, may yield a more accurate profile for the 
iterative steps. Naturally, the choice of the weighting scheme for the multiple aUgnment used to obtain the 
consensus sequence or profile as well as the associated pseudocounts will also exert a significant influence 
on the result. 

Finally, our methodology must be understood in the context of the small size of the testing suite. This 
does not present a significant problem when testing different parameter sets of the same aUgnment algorithm 
but when comparing different algorithms, it is essential to eUminate bias due to superfamily size. Our 
approach, based on sampling 3/4 of the superfamiUes without replacement, was designed with this aim in 
mind. 
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Appendix 



The connection between the transition probabilities of HMMs for sequence evolution and the scoring func- 
tion (scoring matrices and gap parameters) used in sequence comparison is elaborated in this a ppe ndix. 
Since such a connection has been sketched explicitly in earlier publications on hybrid alignment 11361 1371]. 
interested readers are encouraged to look into the original literature. We present a self-contained exposition 
here to save the reader some effort in reading through earlier papers, and to present a minor extension needed 
for aligning a protein sequence to a local HMM with explicit termination probabilities at its nodes. Note that 
keeping a nonzero termination probability is how HMMER achieves local alignments. Hybrid alignments 
achieve a local alignment by taking the maximum of the log-odd ratios at each possible termination point, 
and hence do not need to deal explicitly with the termination probabilities of the HMMs. 

The fundamental idea of protein sequence comparison is rooted in the amino acid score (substitution) 
matrix, where the {i,j)th entry 



1 
A 



In 



Q 



Pipj 



(V) 



is the log-odd ratio of the joint probability Qij of amino acids i and j in the target ensemble to the product 
of the background probabilities, pi and pj, of the two amino acids. Here A is just a scale and is set to unity 
from this point on. For a valid scoring matrix |39], one has pi = ^ Qij and one may express Qij as 



Qij=PiT{j\i) =PjT(i\j), withr(jli) 
In this case, we may also write 



being the probability for amino acid i to mutate into amino acid j. 

mi) 



In 



Pi 



(8) 
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which may now be viewed as the log-odds ratio of a conditional emission probability to the background 
probability. 



Extending this concept 1361. 13711. one may score the global relatedness (alignment) between two protein 
sequences, a and b, the same way: using the log-odds ratio of (5[a, b] to P[a]P[b] (the background prob- 
ability of generating a pair of random sequences a and b). In terms of global relatedness, Q[a, b] may be 
regarded again as i-*[a]T[b|a] and 

Q[a,b] ^ r[H^ 

P[a]P[b] P[h] • ^ ' 

Here ^[bla] is the probability for sequence a to mutate into sequence b. It is not hard to convince oneself 
that there are many different "ways" or "paths" for sequence a to mutate into sequence b. In fact, it has been 
argued that the usual optimal alignment corresponds to the most probable evolutionary path. In this context, 
the gap cost is related to the transition probabilities in and out of the insertion/deletion states of the HMM. 

A protein HMM consists of a number of nodes. Except at the begin node, each node j allows two pos- 
sible states, substitution (S) and deletion (D). The substitution state associated with node j is characterized 
by the transition probability from aj to other amino acids. The deletion state is further divided into cases 
depending on its preceding state. In between two nodes, one can have an insertion (I) state. The transition 
probabilities from a given state to all other allowed states have to sum to one. Four transitions - 5 — > 5, 
S —>■ D, D ^ S, and D ^ D - will each advance the node index by one. Transition S —>■ I and I S com- 
bined together increase effectively the node index by 1, while the transitions / — and D—^I (if allowed) 
do not change the node index at all. In many HMMs, such as the ones used by HMMER, the transitions 
between / and D states are strictly forbidden and we follow this rule in here to simplify our exposition. 




Figure 4: An example of a partial alignment between a profile HMM and a protein sequence. Note that in 
the text, the state preceding 5i is assumed to be a substitution state. 



Constrained by the probability conservation condition, the transition probabilities are usually made 
node-specific (or equivalently termed position-specific). Focusing on the substitution scores of protein 
HMMs, position-specific scoring simply means that the substitution states at different nodes may emit amino 
acids with different sets of probabilities. 

As a concrete example, let us consider an alignment of a partial HMM model of eight nodes aligned with 
a sequence b = [6i , 62 j • • • , ^7] of seven amino acids. Their alignment is shown in Fig. |4l The alignment 
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score S is given by 

S = + 52(62) +5g_6) +57(^3) 

+gU{[bA,b5,bG]) + ssibj) (10) 

where Si{bj) represents the substitution score for amino acid j at node i, 5^_g) represents the gap score 
associated with deleting nodes 3 through 6, and (7^4. ( [64 , 65 , 65] ) represents the gap score associated with 
inserting amino acids 64,65 and ftg between nodes 7 and 8 of the HMM. The superscript "+" associated 
with the I state will be suppressed from this point on. The probability of occurrence associated with this 
alignment A may be written as 

r4[b|a] = P{So ^ Si)Ta, (6i)P(5i ^ 52)T„, (62)^(^2 ^ D^) 

Ta, (63)P(57 ^ h)p{bi)P{h ^ h)p{b,,) 
P{Ir^l7)p{be)P{l7^Ss)Ta,{b7), 

where p{b) is the insertion probabihty of amino acid b between nodes 7 and 8. Assuming that P[b] = 
]^^p(6j), one obtains the ratio 

-T[^-^^^'^^'^^{b;y^^^'^^'^^{b2V^^^'^''^ 

P{D3 ^ D4)P{D4 ^ D5)P{D5 ^ D^)P{De ^ S7) 

P{h^h)^,P{h^Ss)^^. (11) 
P(66) ^(67) 

Comparing (fTOl) and ([TT]) and events of similar type yields the following mappings: 

exp[5,(6,)] = P(5,_i^5,)ZkM 

expbg_6)] = P{D3^D^)P{D^^D.,)P{D5^Dg) 
PiS2^D3)P{De^S7) 



exp[g7([64,65,66]) = pi^g^^Sg) 



^^(56^ 57) 
rW7^^ 

7^5'8) 

, p{bi) p{b->) p{bQ) 
' p{bi) p{b->) pibo) 



(12) 



Frequently HMMs take P{Di^i Di) and P{Ij — > Ij) each to be a constant. In this case, the ra- 
tio ^p^g'^J^p^j^/^ ^ contributes to a position specific gap opening cost and the ratio contributes to a 
composition-dependent insertion cost. The quantity P{I I) contributes to the insertion gap extension 
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cost. If one keeps emission probabilities . (6) node-dependent, but demands that all the state-to-state tran- 
sition probabilities be node-independent, one essentially has a PSSM with uniform affine gap costs, although 
possibly with composition-specific insertion gap costs if p is chosen to be different from the background p. 

Since the transition probabilities are constrained by the respective conservation conditions, and those 
probabilities are related to the scoring function through (fT2l) . the substitution and gap scores are no longer 
independent if one wishes to have a probabilistic interpretation. We now turn to the relationship among score 
parameters when the state to state transition probabilities are node-independent constants. Let t] = P{S — > 
S), /i^i = P{D -> S)/P{S ^ S), = ^ /))^ ^/i = p(j ^ S)/P{S S), /2 = p(5 ^ /), 

= P{I I), and = P{D D). Because and always appear together as a 

product, we further define fi^ = fi^^fi^"^ (fi^ = fi^^jj.^^). The probability conservation condition then 
demands that 

rj + H^^ + fi^^ = 1 (13) 
ri^^^ + u^ = 1 (14) 
7?/i^l + Z^^ = 1. (15) 

Treating u^, u^, fi^ and /i^ as fixed parameters allows us to express rj, /i^^, and fi^^ in terms of us and 
jji^i^) . To do so, we multiply ([141 ) by and multiply ([Tsl ) by Together with ([T3] ). we have three linear 
equations with three unknowns: r], /i^^, and Solving these equations yields 



1 



1 + /,//(!_ J,/) +^D/(i_^D) 

For the case /i^ = /i^ = /i and = = v, these expressions simplify to 



1 - 



1 + 2^-1/ 



Note that with this notation, we may rewrite (1121) as 

exp[si(6j)] = r/ 



1 + 2;U 



^(3-6) 

exp [57 ([64, 65,66]) 



2 ^ p{hi) p{h) pjh 



It becomes evident that hi(^/i/) corresponds to the gap opening score while In(z^) corresponds to the gap 
extension score, and \n{p{b) / p{b)) becomes an additional composition-specific insertion score. 
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In HMMER, the local alignment is terminated by going into the end state, and the end state can be 
reached only from substitution states. In this context, the probability conservation equations ([141 ) and ( fT5] ) 
remain unchanged. However, we may allow a node-specific termination probability from the S state. This 
requires the introduction of a position index for the other transition probabilities. Let 7]m = P{Sm —>■ Sm+i), 

Cm = P{Sm E), = P{Dm —>■ Sm+l)/ P{Sm ~^ Sm+l), Mm^ = P{Sm ~^ Dm+l)^ l^m = P{^m ~^ 

Sm+i)/P{Sm Sm+i), Mm = P{Sm ^ Im)- Howcvcr, notc that P{Im ^ Sm+i) should remain the same, 
because there is no direct transition E. Thus, we may still keep both ^^^^ = fJ-mfJ-m = M ^^i^ 

= ^m — ^ constants. The probability conservation condition then yields 

1 (16) 
1 (17) 
1, (18) 

the solution of which is 

{l-em){l-y) 
l + 2fi-u 
- em) 
1 + 2/i - 

Although /i^^ and /x^ are decreased, /x^^/i^^ and ^JL^I-^-m kept the same as before. As a consequence, 
the only change is that the substitution score at each node is reduced by a node-specific constant ln[l /(I — 
Cm)] when it is not preceded by a gap state. If an alignment has deletion at node m followed by k more 
substitutions from node m + 1 to node m + k, then the substitution score reduction starts only at node m + 2 
and persists to node m + k. 



Vm ~l~ Mm ~^ l^m ~^ ^m — 
??mMm = 



Vm 



Mm 



H'm 
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